-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] New Zig formal grammar #1685
Conversation
Thanks for doing this work.
try (switch (c) { This seems worse. Why is this necessary? if (base.id == (comptime typeToId(T))) { Same question link_err: errorset{OutOfMemory}!void, Can we keep return HashInt(unsigned_x) ^ (comptime rng.scalar(HashInt)); Would this work? pub const LPOVERLAPPED_COMPLETION_ROUTINE = ?(extern fn (DWORD, DWORD, *OVERLAPPED) void); This seems worse. Why is this needed? assert(1234 == (switch (x) {
MultipleChoice.A => 1,
MultipleChoice.B => 2,
MultipleChoice.C => u32(1234),
MultipleChoice.D => 4,
})); Same question. if (t or (x: {
assert(f);
break :x f;
})) { Same question.
I agree that 1 token lookahead is not an important goal to reach. I will happily trade a couple token lookaheads for any other syntactic gain.
Let's keep the dependencies of Zig at a minimum, that is, a system c++ compiler, libLLVM, and libclang. However I would be open to a separate repository dedicated to testing Zig grammar, which we could have the CI use to run on every commit. |
The reason switch and comptime require parens are because I gave them a high precedence, together with the other control flow expressions. For comptime, it was done because of this mess: // A lower percedence `comptime` Expr rule would cause this ambiguity:
async<comptime A> fn()void
//async<(comptime A> fn()void)
//async<(comptime A)> fn()void I think we could also solve this with For switch and blocks, this was done to correctly formalise the rule that these statements shouldn't have semicolons behind them: // Prev rules:
// Statement
// : SwitchExpr
// | Expr Semicolon
// ...
// PrimaryExpr
// : SwitchExpr
// ...
// All these are valid, with the former grammar
switch (a) {}
switch (a) {};
{}
{};
// It would also cause this ambiguity
{}{}; // Is this two blocks, or a block followed by an initializer
switch (a) {}{}; // Same with switch The requirement of parens around fn types is to resolve this: fn()fn()void!void
//fn()(fn()void!void) // This is how it is parsed with the new grammar
//fn()(fn()void)!void
// This solution requires this grammar
// FnTypeExpr
// : ErrorUnionExpr
// | FnTypePrefix ErrorUnionExpr // Just noticed, that this ErrorUnionExpr should be a FnTypeExpr
//
// ErrorUnionExpr
// : PrefixExpr
// | PrefixExpr ExclamationMark PrefixExpr
//
// PrefixExpr
// : SuffixExpr
// | PrefixOp PrefixExpr // To have []fn()void, parens is required I'll look more into laxing these paren requirements. If you have any ideas, I'm all ears :) |
I like the seperate repo idea btw. Where do we keep the grammar? In the Zig or Zig-grammar repo? |
We could, but what about |
In the separate repo I think. ziglang/zig is an implementation of the zig specification (which isn't written yet; see #75) using recursive descent, and the grammar repo would be a tool used for validating and testing the formal grammar specification.
We could make that continue to work with special syntax, yes? It's always been syntactic sugar for |
Right, we could do that. Was trying to keep the |
Since we're formalizing the grammer I'd like to suggest allowing seperators in numeric literals to improve readability. I did some research and C++14 uses single quote Of course there is at least one language where it was discussed and rejected, Go here and here. |
'>' acts exactly like '{' in this case. In some places an expression can not contain '{' it always starts the function body. Here '>' always closes async. In both cases you can allow the use of parentheses to have the code parsed differently. You could always unify these simpler expressions, disallowing both '{' and '>' in both cases. The same could be said about '[' and ']'. When inside '[' the next ']' does not apply to the current expression but to the parent. Ignoring the fact that Zig does not have a ']' operator, but that's besides the point.
Maybe comptime(expr) would also work fine as was suggested. But I think the strategy above might be something to keep in mind when these issues pop up. |
From the compiler writer's POV, I think it's really just about keeping track of what token stops the current expression and returns to the parent. Sometimes it's ']', sometimes it's ',', sometimes it's '{', sometimes it's '>'. |
This is not a problem with
The grammar is ambiguous between an async call and an async function type. |
Yes, my bad I did edit my answer. |
I think this is a nice attempt at making the grammar context free but now that the grammar is simpler for machines it needs to be refined for humans. These things stick out to me as being very awkward:
I shortened some of the identifier names. |
@UniqueID1 |
@UniqueID1 A language is a set of strings. In this case, the set of all syntactically legal Zig programs. A grammar is a structured way of representing a language. One property of a language (but not a grammar) is whether it is context free. If you can write a grammar for Bison, then the language it parses is context free. BNF grammars define languages unambiguously, but not parse trees. Bison emits a warning when there are two different ways to interpret the same input. This does not affect which inputs are accepted, but rather why they are accepted (that is, not the question "is this a legal program?", but rather "what is the structure of this program?"). It is not really enough to write a grammar, but let the resolution of ambiguous parses depend on a hand-written parser. Now you have to understand the parser program; and additionally, if the parser program does not correctly implement the language defined by the grammar, you're sunk. |
@wirelyre Thanks. I deleted my comment. |
Alright, here are the two ways we can choose for
or
The simple grammar can't have both, because it is ambiguous:
We can special case certain things in the body of the if, to allow for return of simple expressions, but the more we special case, the harder it will be to explain the assosiation and precedence of operators ( |
I've thought long and hard about this, and I don't think we can make this grammar be context-free and unambiguous without sacrificing some of the niceness of the current syntactic constructs, or resorting to a special priority system outside the grammar (which kinda defeats the point). I propose that we instead create the grammar as a Parsing expression grammar. The pros here are, that the grammar cannot be ambiguous as only the first matching rule will be considered. The cons to this approach are, that PEGs hide syntax flaws. Here is how I imagine all the current ambiguities will be parsed if we have a PEG:
Also, we can't have the semicolon rule with PEG if we wanna keep the block expression having high priority:
|
It seems like getting rid of <> as grouping operators would solve some problems. C++ has some nasty syntax due to I know the semantics of async are planned to change in the near future. Perhaps those changes will yield different syntax. I've always been uneasy with <> as grouping operators, so avoiding them entirely seems promising. |
In #661 we need a grouping operator to pass a calling convention expression to the |
I think this is the way to move forward. 👍 from me. This is always how I imagined the grammar working. I believe that I have incorrectly been using the term "context-free grammar" when I only meant "you can make a parse tree without doing any semantic analysis". |
The prefix
As we can see, both of these examples cannot be parsed without the parser having to be context aware. |
Can you elaborate a little? I'm focused on this copy elision stuff and so I'm not quite grokking your examples. In the current parser we look for a function call expression directly after |
@andrewrk What we do in both parsers, is look at what type the resulting expression of |
What I show in my examples, is how the PEG grammar would parse the example. Failing, because we cannot express this check in the grammar. |
I have all stage 1 tests passing locally. As I mentioned, we still have these being parsed incorrectly:
I can trivially make the As for Also, In the grammar, For now, I'm gonna do the least amount effort to make stage 2 be able to parse the new |
This is not correct. I just found out that it's not valid and my grammar does the semicolon rule correctly (from the tests I've done). Horray! |
Implemented! |
I have all tests passing locally. This is ready to be merged once CI finish running. If anyone wish for a I'll make a repo for the grammar soon, and find a way to have it be build into the docs. |
Can confirm that I can parse all If time permits will annotate my fork of the grammar (in vbpeg tests) with native language actions to produce some kind of parse tree in JSON format for reference. |
I ended up making the fmt passes, as I neede them myself:
|
@Hejsil can you please help me understand how I can use This change breaks a lot of stuff I have been working on, an automatic conversion will be helpful . |
You will need to build the stage 2 compiler from these branches from source. Guide for that is in the read me |
thanks |
This is an attempt at formalizing a Parsing Expression Grammar for the Zig programming language. This is done to find a better solution for #760.
Currently, I have the grammar posted here using the peg parser generator to validate it. The grammar is a breaking change from 0.3.0 (See what changed in db5d479).
The grammar implements #1047 and some of #114.